Exploratory Data Analysis of Red Wine Quality by Xiaolan Yuan

Univariate Plots Section

The data set in this exploratory data analysis contains observations of 1599 different samples of red wine associated with the levels of the red wine quality and 12 more attributes.

## [1] 1599   14

The name and type of each variable are shown as below:

## 'data.frame':    1599 obs. of  14 variables:
##  $ X                        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity            : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity         : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid              : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar           : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides                : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide      : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide     : num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density                  : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                       : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates                : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol                  : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality                  : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ citric_over_fixed.acidity: num  0 0 0.00513 0.05 0 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality      citric_over_fixed.acidity
##  Min.   : 8.40   Min.   :3.000   Min.   :0.00000          
##  1st Qu.: 9.50   1st Qu.:5.000   1st Qu.:0.01292          
##  Median :10.20   Median :6.000   Median :0.03291          
##  Mean   :10.42   Mean   :5.636   Mean   :0.03084          
##  3rd Qu.:11.10   3rd Qu.:6.000   3rd Qu.:0.04503          
##  Max.   :14.90   Max.   :8.000   Max.   :0.13929

We would like to explore the distribution of the sample red wine over each variable by plotting histograms. First, we want to have a glance at four variables that linked to acidity: ‘fixed.acidity’, ‘volatile.acidity’ ,‘citric.acid’, ‘pH’. Furthermore, we also want to know if the percentage of citric acid in the fixed acids matters to the red wine quality. Hence, we will add one more variable ‘citric_over_fixed.acidity’ to our dataframe.

By the four figures above, we observed that:

Next, we will conduct preliminary exploration on the remain variables.

We have analyzed the remain variables above. The figures indicate:

Univariate Analysis

What is the structure of your dataset?

This dataset has 1599 observation and 13 original variables. The variables can be divided into 4 part:

  • variables linked to acids: ‘fixed.acidity’, ‘volatile.acidity’, ‘citric.acid’, ‘pH’

  • varibales linked to other main components: ‘residual.sugar’, ‘alcohol’, ‘density’

  • variables linked to additives: ‘chlorides’, ‘free.sulfur.dioxide’, ‘total.sulfur.dioxide’, ‘sulphates’

  • main variables: ‘quality’

Here are some main conclusions from the above plots:

  • The fixed acidity varies mostly from 0.1 to 1.0 with a few outliers above 1.0.

  • A significant proportion of red wine does not add citric acid. The citric acid density is potentially highly related to the pH value.

  • The residual.sugar distribution has high peak around 2 to 3. And the alcohol distirbution varies mostly from 9 to 14 with a high peak around 9.5

  • The distribution of cholrides and sulphates have relatively wider ranges.

What is/are the main feature(s) of interest in your dataset?

In this analysis, the main interests would be ‘quality’ and ‘pH’.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

  • Other variables such as ‘citric.acid’, ‘residual.sugar’, ‘alcohol’, ‘chlorides’, ‘free.sulfur.dioxide’, ‘total.sulfur.dioxide’, ‘sulphates’ would also be important for a predictive model for ‘quality’.

  • ‘fixed.acidity’, ‘volatile.acidity’, ‘citric.acid’ would be helpful to investigate ‘pH’.

Did you create any new variables from existing variables in the dataset?

Yes, the ‘citric_over_fixed.acidity’ shows the ratio between ‘citric.acid’ and ‘fixed.acidity’

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Many variables such as ‘residual.sugar’, ‘alcohol’, ‘chlorides’, ‘free.sulfur.dioxide’, ‘total.sulfur.dioxide’ and ‘sulphates’ are long-tail distributions. We transfomed the x-axis scale to ‘log10’ scale to make the distribution more symmetric.

Bivariate Plots Section

Here we will investigate the relationship between pairs of any two variables by the correation matrix below.

From above, we can draw conclusions:

Now, we start to explore relationship between two variables. First, let us use scatter plot to analyze 2 variable.

From the scatter plots above, we can see that only ‘alcohol’ and ‘volatile.acidity’ have noticable linear correlation with ‘quality’.

Here, let’s also explore the relation between other pairs of variables with high correlation coefficients except for the pairs that contains ‘quality’

It seems that ‘total.sulfur.dioxide’ and ‘free.sulfur.dioxide’ has a linear relation.

## 
## Call:
## lm(formula = free.sulfur.dioxide ~ total.sulfur.dioxide, data = subset(redwine, 
##     total.sulfur.dioxide < quantile(redwine$total.sulfur.dioxide, 
##         0.9)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4478  -3.5173  -0.9483   3.2228  28.1375 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.211628   0.364571   8.809   <2e-16 ***
## total.sulfur.dioxide 0.296045   0.008251  35.881   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.843 on 1437 degrees of freedom
## Multiple R-squared:  0.4726, Adjusted R-squared:  0.4722 
## F-statistic:  1287 on 1 and 1437 DF,  p-value: < 2.2e-16

When dropping 10% percentage of the largest value of ‘total.sulfur.dioxide’, it explained around 47% variance of ‘free.sulfur.dioxide’ by the R-squared score.

## 
## Call:
## lm(formula = pH ~ fixed.acidity, data = subset(redwine, fixed.acidity < 
##     quantile(redwine$fixed.acidity, 0.9)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49613 -0.06401  0.00121  0.06727  0.49291 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.916363   0.019230  203.65   <2e-16 ***
## fixed.acidity -0.073939   0.002406  -30.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1124 on 1431 degrees of freedom
## Multiple R-squared:  0.3976, Adjusted R-squared:  0.3972 
## F-statistic: 944.4 on 1 and 1431 DF,  p-value: < 2.2e-16

When dropping 10% percentage of the largest value of ‘fixed.acidity’, it explained around 40% variance of ‘pH’ by the R-squared score.

## 
## Call:
## lm(formula = pH ~ citric.acid, data = subset(redwine, citric.acid < 
##     quantile(redwine$citric.acid, 0.9)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50024 -0.07745 -0.00594  0.08267  0.58243 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.427575   0.005972  573.91   <2e-16 ***
## citric.acid -0.430289   0.021020  -20.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1303 on 1437 degrees of freedom
## Multiple R-squared:  0.2258, Adjusted R-squared:  0.2252 
## F-statistic:   419 on 1 and 1437 DF,  p-value: < 2.2e-16

For now it seems ‘pH’ is less correlated to ‘citric.acid’ than ‘fixed.acidity’. We also want to see how the how the ratio between them correlated to ‘pH’.

## 
## Call:
## lm(formula = pH ~ citric_over_fixed.acidity, data = subset(redwine, 
##     citric_over_fixed.acidity < quantile(redwine$citric_over_fixed.acidity, 
##         0.9)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50040 -0.08331 -0.00459  0.08673  0.60393 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.42333    0.00679  504.17   <2e-16 ***
## citric_over_fixed.acidity -3.90239    0.21261  -18.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1383 on 1437 degrees of freedom
## Multiple R-squared:  0.1899, Adjusted R-squared:  0.1894 
## F-statistic: 336.9 on 1 and 1437 DF,  p-value: < 2.2e-16

By the result above, it is reasonable to infer that the ratio ‘citric_over_fixed.acidity’ is not a valuable feature for linear regression model to predict ‘pH’, because it has a weaker linear correlation to ‘pH’.

Secondly, we will use box plot to show the ‘quality’ distribution in different intervals of the value of a variable.

The structure of ‘residual.sugar’ is listed as below, we will divied the variable value into 3 intervals based on the structure.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

## redwine$residual.sugar.bucket: (0.1,1.9]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.599   6.000   8.000 
## -------------------------------------------------------- 
## redwine$residual.sugar.bucket: (1.9,2.5]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    6.00    5.66    6.00    8.00 
## -------------------------------------------------------- 
## redwine$residual.sugar.bucket: (2.5,16]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.637   6.000   8.000

The distribution of ‘quality’ does not change significantly in different groups cut by different values of ‘residual.sugar’, which infers that ‘residual.suger’ may not influence ‘quality’ significantly.

The structure of ‘chlorides’ is listed as below, we will divied the variable value into 3 intervals based on the structure.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

## redwine$chlorides.bucket: (0,0.07]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.873   6.000   8.000 
## -------------------------------------------------------- 
## redwine$chlorides.bucket: (0.07,0.09]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.576   6.000   8.000 
## -------------------------------------------------------- 
## redwine$chlorides.bucket: (0.09,0.12]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   5.000   5.554   6.000   7.000 
## -------------------------------------------------------- 
## redwine$chlorides.bucket: (0.12,0.7]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    5.00    5.38    6.00    7.00

Based on this box plot, we can draw the conclusion that when ‘chlorides’ has lower value, from 0 to 0.07, the red wine quality tend to have a higher level. In this plot we can see that in the first group, ‘quality’ has higher median and mean levels.

The structure of ‘total.sulfur.dioxide’ is listed as below, we will divied the variable value into 3 intervals based on the structure.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

## redwine$total.sulfur.dioxide.bucket: (5,22]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.748   6.000   8.000 
## -------------------------------------------------------- 
## redwine$total.sulfur.dioxide.bucket: (22,62]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.709   6.000   8.000 
## -------------------------------------------------------- 
## redwine$total.sulfur.dioxide.bucket: (62,289]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   5.000   5.000   5.374   6.000   8.000

By the box plot of ‘quality’ in different groups of ‘total.sulfur.dioxide.bucket’, we see that in first two groups: (5,22] and (22.62] the red wine quality has similar distribution, but after the ‘total.sulfur.dioxide’ falls in (62,289], the mean and median of ‘quality’ decreased. This implies that when ‘total.sulfur.dioxide’ reaches a relatively high amount in red wine, it may potentially decrease the red wine quality.

The structure of ‘citric.acid’ is listed as below, we will divied the variable value into 3 intervals based on the structure.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

## redwine$citric.acid.bucket: (-0.1,0.09]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   5.000   5.459   6.000   8.000 
## -------------------------------------------------------- 
## redwine$citric.acid.bucket: (0.09,0.42]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     5.0     6.0     5.6     6.0     8.0 
## -------------------------------------------------------- 
## redwine$citric.acid.bucket: (0.42,1]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.887   6.000   8.000

The structure of ‘citric_over_fixed.acidity’ is listed as below, we will divied the variable value into 3 intervals based on the structure.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01292 0.03291 0.03084 0.04503 0.13930

## redwine$citric_over_fixed.acidity.bucket: (-0.1,0.013]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   5.000   5.462   6.000   8.000 
## -------------------------------------------------------- 
## redwine$citric_over_fixed.acidity.bucket: (0.013,0.045]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   5.000   5.578   6.000   8.000 
## -------------------------------------------------------- 
## redwine$citric_over_fixed.acidity.bucket: (0.045,0.14]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.928   7.000   8.000

Based on the two plots above, we noticed that when increasing the amount of ‘citric.acid’ the red wine quality increased. When the ratio between ‘citric.acid’ and ‘fixed.acidity’ falls into a ‘reasonable’ interval, (0.045,0.14], the red wine quality has a distribution of smaller range and tends to have a higher average quality level.

The structure of ‘sulphates’ is listed as below, we will divied the variable value into 3 intervals based on the structure.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

By this plot, we may infer that higher value of sulphates density results in better quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Weak correlation:

The red wine quality index, ‘quality’, is not highly correlated to envrionment attributes, ‘pH’ and ‘residual.sugar’, which rejects my intuition.

medium correlation:

The red wine quality index, ‘quality’ is medium correlated to additive attributes:

  • positive correlation: ‘fixed.acidity’, ‘citric.acid’, ‘sulphates’;

  • negative correlation: ‘volatile.acidity’, ‘chlorides’, ‘total.sulfur.dioxide’, ‘free.sulfur.dioxide’.

strong correlation:

The red wine quality index, ‘quality’ is highly correlated to the variable ‘alcohol’. ‘quality’ tends to have higher level when the ‘alcohol’ density goes higher.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes, we find that the ‘pH’ has postive correlation coefficients with ‘fixed.acidity’ and ‘citric.acid’ and has negative correlation coefficients with ‘chlorides’.

What was the strongest relationship you found?

The strongest relationship is between ‘quality’ and ‘alcohol’, which seems reasonable for red wine quality.

Multivariate Plots Section

In this section, we first want to explore more on the distribution of ‘quality’ over ‘chlorides’, since the scatter plot in last section does not show the distribution clearly.

By the plots above, we notice that:

  • Figure 1: the two white dash line represents 0.95 and 0.05 quantile of quality, and the white solid line represents the median of quality. No trend stands out in this figure.

  • Figure 2: We notice that as chlorides density goes up, the red wine tend to have a lower quality.

  • Figure 3: the results makes more sense here than in previous two pictures. When chlorides density is low, the red wine may be produced in a healthier way with less additives and hence to have better quality (level 6 to level 7). When chlorides increases slightly, it should be easy to produce mediate level of quality (level 5 to level 6) which should most popular in people’s daily life. When adding more chlorides, the you can see the quality of red wine is hard to be high. Those samples might be produced in a worse technology.

Second, we want to futher explore the relationship between main variable ‘quality’ and other correlated variables with the absolute value of correlation coeffients more than 0.2.

This plot shows a trend that red wine quality is better when the chlorides density is lower conditioning on the same alcohol density. We want to examine whether the pattern holds in different subsets of the whole datasets.

As you can see the pattern is more and more clear as ‘total.sulfur.dioxide’ density goes up. Futhermore, in the subgroups (3,1), (3,2), (3,3), the samples with high alcohol density and low chlorides density stands out with better quality. This observation implies that when add a high density of sulfur dioxide in the red wine, less chlorides and more alcohol results in a better quality. Meanwhile, if the density of sulfur dioxide is low, there are other crutial variables to determine the red wine quality. We have one more suprise here: as the density of sulphates increases, the density of chlorides tends to be higher. This observation requires more chemical research.

The scatter plot of ‘quality’ over ‘alcohol’ in different ‘sulphates’ intervals shows a pattern that higher sulphates density may result in a better quality. Let’s check if this pattern holds in subsets of the whole sample dataset.

This pattern holds in general case, except for subset (1,1). In this subset we comfirmed a observation again that when sulfur dioxide and chlorides have low density, sulphates density does not influence the red wine quality. In addition, we can conclude that sulfur dioxide density is influence the red wine quality in a positive way.

In the second plot, we find that the pattern of quality over alcohol is not as clear as previous two patterns above. This can be explained by the first density plot. Notice that when the densitiy of total sulfur dioxide increases from a low level to a medium level, the red wine quality tends to be better. However, when then densitiy of total sulfur dioxide increases further from medium level to high level, the red wine quality drops again. It is reasonable to infer that some producers added too much sulfur dioxide in the red wine so that their product quality is not as good as others.

The patterns diverge in this plot in different subsets, but we still find some interesting trends.

  • In subset (1,1), where chlorides and sulphates both have low density values, the red wine samples with high total sulfur dioxide density tends to have a better quality. The reason might be that the amount of other additives is so small that the red wine can not have anti-microbial property.

  • In subsets in the 2nd, 3rd and 4th columns, we noticed that the high sulfur dioxide density always shows up in red wine with lower alcohol density. This can also be a result of anti-microbial purpose, since low alcohol density does not perform well in killing bad bacterium.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

In this section, we explored the relationship between ‘quality’ and ‘alcohol’ along with three related variables, ‘sulphates’, ‘chlorides’ and ‘total.sulfur.dioxide’. First of all, we already know that ‘quality’ is positively correlated with ‘alcohol’ in the last setction. Then by the observation in this section, we noticed a trend that red wine quality is better when the chlorides density is lower and when sulphates density is higher conditioning on the same alcohol density.

Were there any interesting or surprising interactions between features?

For the variable ‘total.sulfur.dioxide’, we didn’t find a fixed pattern in different subsets, but when other additives have a medium density level, the high density of sulfur dioxide results in bad quality. We also found that the red wine with low alcohol density uses more sulfur dioxide for a potential reason that it could contributes in the anti-microbial property.

Another interesting point is that the red wine samples with high alcohol density and low chlorides density stands out with better quality. This observation implies that when add a high density of sulfur dioxide in the red wine, less chlorides and more alcohol results in a better quality.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

This plot explores the relationship between ‘quality’ and ‘chlorides’. Using scatter plot in figure 1, we can not find any trend even by adding auxiliary lines to show the 0.95 quantile, 0.05 quantile and median of ‘quality’ over each ‘chlorides’ value, because the ‘quality’ variable is not continous. In figure 2, we use box plot to show the descriptive statistics in each chlorides intervals, which gives us a trend that as chlorides density increases the quality of red wine decreases. In figure 3, the plot shows the trend more clear.

Plot Two

Description Two

We know that the correlation coefficient between alcohol and quality is 0.521, which is relatively high. In this plot, we see that the data points ‘follow’ the linear regression line well.

Plot Three

Description Three

This plot reveals different patterns of quality over alcohol in different subsets. We noticed that when the density of other additives or the alcohol density is low, most red wine samples add relatively large amount of sulfur dioxide.


Reflection

Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.

This exploratory data analysis is about a dataset including information about different ingredients of red wine samples. This dataset has 1599 observations and 12 variables except for the index. In initial phase, I explored each variables by doing univariable analysis. In this preliminary exploration, I noticed that some of the variables have long-tail distribution. Hence I transformed the x-axis scale to ‘log10’ scale and varified the log distribution is bell-like and symmetric. I noticed that the quite a few red wine samples have ‘citric.acid’ value equal to 0. This made me start to wondering whether citric acid is a good additive in red wine, and how about other additives. By showing the correlation matirx, I decided to focus on exploring the variables of higher than 0.2 correlation coefficients with ‘quality’ in pairs. Next I compared those pairs along with other variables together.

The main finds can be summarized as follows. The variables I am interested in are ‘quality’, ‘alcohol’, ‘sulphates’, ‘total.sulfur.dioxide’, ‘chlorides’. The most correlated variable with ‘quality’ is ‘alcohol’. The correlation coefficient is 0.521. The most interesting observation is that the total sulfur dioxide density plays a critical rules in enhancing red wine quality. I noticed the red wines with high total sulfur dioxide density does not need much chlorides to reach a high quality when alcohol density is high. In addition, when other additives are not in a high density, a lot of red wine samples add relatively large amount of sulfur dioxide to reach medium or medium-to-high quality level. Sulphates density and chlorides density also influence the red wine quality positively and negatively.